PLSC30500, Fall 2024

Part 3. Learning from random samples (part b)

Andy Eggers

Recapping

Say our estimand \(\theta\) is a population mean, i.e. \({\textrm E}[X]\)

We have a plug-in estimator \(\hat{\theta}\), the sample mean \(\overline{X}_{(n)}\).

\(\overline{X}\) has a sampling distribution. What do we know about it?

  • centered on \({\textrm E}[X]\), i.e. unbiased: \({\textrm E}[\overline{X}] = {\textrm E}[X]\)
  • variance is \({\textrm V}[X]/n\), which we can approximate with \(\hat{V}[X]/n\)
  • by CLT, asymptotically normal

What can we do with this knowledge?

All-knowing perspective

Suppose we knew \({\textrm E}[X]\) and \({\textrm V}[X]\).

Then we know the asymptotic sampling distribution of \(\overline{X}\), i.e. \(N\left({\textrm E}[X], \frac{{\textrm V}[X]}{n}\right)\).

We could then compute, e.g. the probability of observing a sample mean below 2 if \({\textrm E}[X] = 2.104\), \({\textrm V}[X] = 4.63\), and \(n = 198\).

pnorm(q = 2, mean = 2.104, sd = sqrt(4.63/198))
[1] 0.2482193

Similar if we only knew \({\textrm E}[X]\) but not \({\textrm V}[X]\).

The researcher’s perspective

But we don’t know \({\textrm E}[X]\) (that’s why we’re doing research).

We have an estimate \(\overline{X}\), and we can estimate \({\textrm V}[X]\) from the sample. What more can we say about \({\textrm E}[X]\)?

Reporting \(\hat{\sigma}[X] = \sqrt{\frac{\hat{{\textrm V}}[X]}{n}}\), the standard error, is a start.

Approaches we will take:

  • what is an interval centered on \(\overline{X}\) that will contain \({\textrm E}[X]\) in at least 95% of samples? (confidence interval)
  • suppose \({\textrm E}[X] = c\). what is the probability of observing \(\overline{X}\) at least as far from \(c\) as what we observed? (p-value)

The researcher’s perspective (2)

Generally, these tools of statistical inference require

  • having a plug-in estimator for \(\theta\), i.e. \(\hat{\theta}\)
  • having a plug-in estimator for \({\textrm V}[\hat{\theta}]\), i.e. \(\hat{{\textrm V}}[\hat{\theta}]\)
  • conditions for approximately normal sampling distribution of \(\hat{\theta}\)
    • “mild regularity conditions”
    • “large” \(n\)

Does not require anything else about the underlying data (e.g. normality).

\(\implies\) “agnostic statistics”.

Confidence intervals

Confidence interval motivation

Could we specify a range that is likely (e.g. 95% likely) to include \(\theta\)?

  • Frequentist interpretation: 95% of the time we construct such a range from a sample, it will include \(\theta\)
  • Bayesian interpretation: I believe with 95% certainty that this range includes \(\theta\)

That is the goal of constructing a confidence interval.


Lazy confidence intervals that are certain to include \(\theta\):

  • for a proportion: \([0, 1]\)
  • for GDP per capita: \([0, \infty]\)
  • for average growth in income: \([-\infty, \infty]\)

We seek smaller ones that are likely to include \(\theta\).

Confidence interval definition

An interval \(CI\) is a valid confidence interval for \(\theta\) with coverage \((1 - \alpha)\) if

\[\text{Pr}[\theta \in CI] \geq 1 - \alpha\]

Typical to choose \(\alpha = .05\), so the CI’s coverage is .95.

In the frequentist view, \(\theta\) is fixed and \(CI\) is the random variable; the probability statement is about repeated samples.

CI construction (1)

Suppose we know that \(\hat{\theta}\) is distributed normally with mean \(\theta\) and variance \(\sigma^2\) (i.e. \(N(\theta, \sigma^2)\).

For now, suppose we know \(\theta\). (Remember, in real life we don’t.)

What is the shortest interval \([a,b]\) that will contain \(\hat{\theta}\) 95% of the time?

CI construction (2)

Because \(\hat{\theta}\) is normally distributed, the shortest interval \([a,b]\) that will contain \(\hat{\theta}\) 95% of the time is \[[\theta - 1.96 \sigma, \theta + 1.96 \sigma]\]

For 90% interval, \[[\theta - 1.64 \sigma, \theta + 1.64 \sigma]\]

Where do these numbers come from?

# for 95% CI
qnorm(.025)
[1] -1.959964
qnorm(.975)
[1] 1.959964
pnorm(1.96) - pnorm(-1.96)
[1] 0.9500042
# for 90% CI
qnorm(.05)
[1] -1.644854
qnorm(.95)
[1] 1.644854
pnorm(1.64) - pnorm(-1.64)
[1] 0.8989948

Demonstration

Everyone use R to draw a single value from normal distribution with mean 4 and sd 2.

What proportion of draws are

  • outside the interval \([4 - 1.96 \times 2, 4 + 1.96 \times 2] = [0.08, 7.92]\)?
  • outside the interval \([4 - 1.64 \times 2, 4 + 1.64 \times 2] = [.72, 7.28]\)?

CI construction (3)

We have an interval that contains 95% of \(\hat{\theta}\) draws, given \(\theta\) and \(\sigma\).

We want an interval that contains \(\theta\) 95% of the time, given \(\hat{\theta}\) and \(\hat{\sigma}\).

Consider this interval:

\[\left[ \hat{\theta} - 1.96 \hat{\sigma}, \hat{\theta} + 1.96 \hat{\sigma} \right] \]

We can construct it without knowing \(\theta\), and (asymptotically) it contains \(\theta\) 95% of the time!

CI construction (4)

Demonstration 2

Everyone use R to draw a sample of size \(n=400\) from normal distribution with mean 4 and sd 2. Using your sample, make a 90% confidence interval for the population mean:

  • get \(\hat{\theta}\) and \(\hat{\sigma}\)
  • make \(\left[ \hat{\theta} - 1.64 \hat{\sigma}, \hat{\theta} + 1.64 \hat{\sigma} \right]\)

Does your CI include \(4\) (the population mean)?

Demonstration 2 (code)

samp <- rnorm(n = 400, mean = 4, sd = 2)
hat_theta <- mean(samp)
hat_sigma <- sd(samp)/sqrt(400)  # = sqrt(var(samp)/400)
# lower bound
hat_theta - 1.64*hat_sigma
[1] 3.804415
# upper bound
hat_theta + 1.64*hat_sigma
[1] 4.126901

Illustration: 90% confidence interval

Illustration: 99% confidence interval

Interpretation of confidence intervals

As a frequentist concept, the confidence interval is about a long-run average: if I make many 95% (valid) confidence intervals, 95% of them will contain the true value.

Recall, for a valid 95% CI \(\text{Pr}(\theta \in \text{CI}) = 95\%\).

\(\theta\) is not a random variable; this is a probability statement about the frequency of the CI including \(\theta\), not your beliefs about where \(\theta\) is.

But in the absence of other information, a Bayesian would say “There is a 95% probability that \(\theta\) is in this CI.”

Hypothesis testing and p-values

Hypothesis testing: motivation

With confidence intervals, we report an interval centered on estimate \(\hat{\theta}^*\) that is likely (in either frequentist or Bayesian sense) to contain the estimand \(\theta\).

Another approach: hypothesis testing.

Basic idea:

  • specify a null hypothesis \(\theta_0\): a possible value of \(\theta\) (typically one you’re arguing against)
  • say how unlikely your result (or a more extreme one) would be if null hypothesis were true (\(p\)-value)

The more unlikely your result \(\hat{\theta}^*\) would be under the null (i.e. the lower the \(p\)-value), the more doubtful the null hypothesis appears.

The logic of hypothesis testing

Similar to proof by contradiction (modus tollens):

  • “If \(A\), then \(B\); but \(B\) is false, so \(A\) is false.”
  • “If he loved me, then he would have called; he didn’t call, so he doesn’t love me.”

But it’s a probabilistic version (weak syllogism):

  • “If \(A\), then \(B\) likely; but \(B\) is false, so \(A\) becomes less likely”
  • “If he loved me, then he probably would have called; the fact that he didn’t call makes me more doubtful that he loves me”

The latter conclusion is logically warranted (by Bayes’ rule) if \(\text{Pr}(\text{no call} \mid \text{love}) < \text{Pr}(\text{no call} \mid \text{no love})\).

But the conclusion that “he probably doesn’t love me” is not – it depends on how confident you were of his love before.

Bayes rule, again

We have:

\[\begin{align} \text{Pr}[\text{love} \mid \text{no call}] &= \frac{\text{Pr}[\text{no call} \mid \text{love}] \text{Pr}[\text{love}]}{\text{Pr}[\text{no call}]}\\ \text{Pr}[\text{no love} \mid \text{no call}] &= \frac{\text{Pr}[\text{no call} \mid \text{no love}] \text{Pr}[\text{no love}]}{\text{Pr}[\text{no call}]} \end{align}\]

The ratio between them:

\[\overbrace{\frac{\text{Pr}[\text{love} \mid \text{no call}]}{\text{Pr}[\text{no love} \mid \text{no call}]}}^{\text{Posterior odds}} = \overbrace{\frac{ \text{Pr}[\text{no call} \mid \text{love}] }{\text{Pr}[\text{no call} \mid \text{no love}] }}^{\text{Likelihood ratio}} \overbrace{\frac{\text{Pr}[\text{love}]}{\text{Pr}[\text{no love}]}}^{\text{Prior odds}}\]

Posterior odds lower than prior odds \(\iff\) likelihood ratio \(< 1\).

\(p\)-value logic (1)

Remember: for large enough \(n\), a plug-in estimator \(\hat{\theta}\) is normally distributed around estimand \(\theta\) if

  • unbiased (or consistent), and
  • “mild regularity conditions” hold

But we don’t know \(\theta\).

\(p\)-value logic (2)

But we can say, “Suppose \(\theta = \theta_0\)”.

In that case, we know that*

\[\hat{\theta} \sim N(\theta_0, \hat{\sigma}^2)\] and we can estimate the probability that \(\hat{\theta}\) would be in any interval, given \(\theta_0\).

*Asymptotically, and if \(\text{V}[\hat{\theta} \mid \theta = \theta_0] = \text{V}[\hat{\theta}]\)

The lower one-tailed \(p\)-value

Lower one-tailed p-value:

\[\text{Pr}[\hat{\theta} \leq \hat{\theta}^*] \, \, \text{assuming} \,\, \theta = \theta_0\] i.e. “under the null”.

The upper one-tailed \(p\)-value

Upper one-tailed p-value:

\[\text{Pr}[\hat{\theta} \geq \hat{\theta}^*] \, \, \text{assuming} \,\, \theta = \theta_0\] i.e. “under the null”.

The two-tailed \(p\)-value

Two-tailed p-value: \(\text{Pr}\left[\lvert\hat{\theta} - \theta_0\rvert \geq \lvert\hat{\theta}^* - \theta_0\rvert \right]\) assuming \(\theta = \theta_0\), i.e. “under the null”.

Computing \(p\)-values (1)

It is useful to transform our estimate \(\hat{\theta}^*\) into a \(t\)-statistic:

\[t = \frac{\hat{\theta}^* - \theta_0}{\sqrt{\hat{\text{V}}[\hat{\theta}]}} \]

In words: the difference between the estimate and the null, divided by the standard error of the estimator.

\(|t|\) gets bigger when

  • estimate gets further from null, or
  • estimator gets more precise

If the null is true (\(\theta = \theta_0\)), then asymptotically \(t \sim N(0, 1)\).

Computing \(p\)-values (2)

Suppose \(t^*\) (the observed \(t\)) is \(-1.5\).

Since asymptotically \(t \sim N(0, 1)\) under null hypothesis, we can compute asymptotically valid lower one-tailed \(p\)-value as follows:

my_t <- -1.5
pnorm(my_t) 
[1] 0.0668072

Computing \(p\)-values (3)

Suppose \(t^* = 1.5\).

We can compute asymptotically valid upper one-tailed \(p\)-value as follows:

my_t <- 1.5
1 - pnorm(my_t) 
[1] 0.0668072

Computing \(p\)-values (4)

We can compute asymptotically valid two-tailed \(p\)-value as follows:

pnorm(-abs(my_t)) + 1 - pnorm(abs(my_t)) 
[1] 0.1336144
2*(pnorm(-abs(my_t)))
[1] 0.1336144
2*(1 - pnorm(abs(my_t)))
[1] 0.1336144

One-tailed vs two-tailed

  • Lower one-tailed \(p\)-value answers question, “How likely is it that I would get a value at least as low as \(\hat{\theta}^*\) (\(t^*\)) if \(\theta = \theta_0\)?” (similar for upper)
  • Two-tailed \(p\)-value answers question, “How likely is it that I would get a value at least as extreme as \(\hat{\theta}^*\) (\(t^*\)) if \(\theta = \theta_0\)?”


In principle, both interesting. One-tailed especially relevant if the null hypothesis is really that \(\theta \geq \theta_0\).


In practice, we only use one-tailed tests when the test is pre-registered. Otherwise two-tailed to be conservative, given the way \(p\)-values are used in testing.

Null hypothesis significance testing (NHST)

Convention (credited to Fisher) is: “Reject null hypothesis if \(p < .05\).” When the null is true, rejection should occur 5% of the time.

Rejection of null hypothesis commonly interpreted (by seminar audiences, reviewers, editors, hiring committees) as “finding something”.

So researchers really want to reject null.

Many “best practices” are about minimizing cheating:

  • corrections for multiple hypothesis tests (problem set 2)
  • pre-analysis plans that specify how analysis will be run
  • requirement to share data and code (replication archive)

In this context, one-tailed p-values are seen as cheating.

Interpretation of p-values

Intuitively, a low \(p\)-value means “if the null hypothesis (that \(\theta = \theta_0\)) were true, we would infrequently encounter a result as extreme as the one that we saw. Therefore, if we reject the null hypothesis (that is, if we conclude that \(\theta \neq \theta_0\)) based solely on how extreme the result is, then that decision will be a mistake either infrequently (if \(\theta = \theta_0\)) or never (if \(\theta \neq \theta_0\)).” (Aronow & Miller, p. 128)


So how often will it be a mistake? i.e. what is \(\text{Pr}[\theta = \theta_0 \mid \text{reject}]\) (shifting to Bayesian perspective)?


Based on above, sounds like our rejections are incorrect “between infrequently (\(\alpha\)) and never”.

Interpretation of p-values (2)

What is \(\text{Pr}[\theta = \theta_0 \mid \text{reject}]\) (probability a rejection is a mistake)?

Use Bayes’ Rule (problem set 2): \[\begin{align} \text{Pr}[\theta = \theta_0 \mid \text{reject}] &= \frac{\text{Pr}[\text{reject} \mid \theta = \theta_0 ] \text{Pr}[\theta = \theta_0]}{\text{Pr}[\text{reject} \mid \theta = \theta_0 ] \text{Pr}[\theta = \theta_0] + \text{Pr}[\text{reject} \mid \theta \neq \theta_0 ] \text{Pr}[\theta \neq \theta_0]} \\ &= \frac{\alpha p_0}{ \alpha p_0 + \text{Power} (1 - p_0)} \end{align}\] where \(\alpha = \text{Pr}[\text{reject} \mid \theta = \theta_0 ]\), \(p_0 = \text{Pr}[\theta = \theta_0]\) and \(\text{Power} = \text{Pr}[\text{reject} \mid \theta \neq \theta_0 ]\).

Suppose \(\alpha = .05\) (standard) and \(p_0 = .5\) (good chance \(\theta = \theta_0\)).

Then

  • if Power = .8 (standard target), \(\text{Pr}[\theta = \theta_0 \mid \text{reject}] \approx .06\)
  • if Power = .05 (very bad), \(\text{Pr}[\theta = \theta_0 \mid \text{reject}] = .5\)

What is going on (1)?

Suppose 200 tests will be performed, and \(\text{Pr}[\theta = \theta_0] = .5\).

Good situation (power = .8):

  • 100 cases where \(\theta = \theta_0\), with 5 rejections.
  • 100 cases where \(\theta \neq \theta_0\), with 80 rejections.

Then only \(5/85 \approx .056\) of the rejections were mistakes.


Bad situation (power = .05):

  • 100 cases where \(\theta = \theta_0\), with 5 rejections.
  • 100 cases where \(\theta \neq \theta_0\), with 5 rejections.

Then \(5/10 = .5\) of the rejections were mistakes.

What is going on (2)?

Intuitively, … “if we reject the null hypothesis…based solely on how extreme the result is, then that decision will be a mistake either infrequently (if \(\theta = \theta_0\)) or never (if \(\theta \neq \theta_0\)).” (Aronow & Miller, p. 128; emphasis added)


If “that decision” means “rejecting the null hypothesis”, then “infrequently” is wrong: \[\text{Pr}[\text{rejection is mistake} \mid \theta = \theta_0] = 1\]


If “that decision” means “rejecting only if the result is sufficiently extreme”, then “never” is wrong: \[\text{Pr}[\text{fail to reject} \mid \theta \neq \theta_0] < 1\]

More on interpreting \(p\)-values

Note that a high \(p\)-value does not offer the same guarantees for those looking to accept a null hypothesis and is accordingly limited in its utility for decision making. (Aronow & Miller, p. 128)

Is this true? What do we learn from a high \(p\)-value?

Again Bayes’ Rule says it depends on the likelihood ratio:

\[\begin{align} \frac{\text{Pr}[ \theta = \theta_0 \mid \text{high p-value}]}{\text{Pr}[ \theta = \theta_1 \mid \text{high p-value}]} &= \frac{\frac{\text{Pr}[ \text{high p-value} \mid \theta = \theta_0 ] \text{Pr}[ \theta = \theta_0 ]}{\text{Pr}[ \text{high p-value}]}}{\frac{\text{Pr}[ \text{high p-value} \mid \theta = \theta_1 ] \text{Pr}[ \theta = \theta_1 ]}{\text{Pr}[ \text{high p-value}]}} \\ &= \frac{\text{Pr}[ \text{high p-value} \mid \theta = \theta_0 ]}{\text{Pr}[ \text{high p-value} \mid \theta = \theta_1 ]} \frac{\text{Pr}[ \theta = \theta_0 ]}{\text{Pr}[ \theta = \theta_1 ]} \end{align}\]

If a high \(p\)-value is much more likely under the null than when \(\theta = \theta_1\), then \(\theta_0\) may become much more plausible compared to \(\theta_1\).

Interpreting p-values: conclusion

  • interpreting \(p\)-values is hard
  • lower \(p\)-value casts more doubt on null hypothesis
  • saying more (e.g. about share of rejections that are wrong, probability null is true) requires more information
  • use Bayes’ rule

The bootstrap

Bootstrap motivation

We know that plug-in estimators \(\hat{\theta}\) are normally distributed (under mild regularity conditions).

In many cases we can prove unbiasedness (\({\textrm E}[\hat{\theta}] = \theta\)) or consistency (\(\hat{\theta} \rightarrow \theta\)) (e.g. sample variance)


But what about the variance \({\textrm V}[\hat{\theta}]\)? (necessary for CIs, \(p\)-values)

We proved that \({\textrm V}[\overline{X}] = \frac{{\textrm V}[X]}{n}\) given iid samples, but what about

  • other estimators?
  • non iid samples?

Example: correlation between two variables, ratio of two means


The bootstrap is a very general solution for estimating \({\textrm V}[\hat{\theta}]\).

Bootstrap basics

\({\textrm V}[\hat{\theta}]\) describes the variance of \(\hat{\theta}\) across samples of size \(n\).

Problem: We have only one sample of size \(n\).

Bootstrap solution:

  • Generate \(m\) artificial samples of size \(n\) by resampling from our sample with replacement
  • Estimate \({\textrm V}[\hat{\theta}]\) using the variance of \(\hat{\theta}\) across these artificial samples

Bootstrap illustration

samp <- c(4,2,5,3,6,6)
mean(samp)
[1] 4.333333
(resamp1 <- sample(samp, size = length(samp), replace = T))
[1] 2 5 2 2 6 3
mean(resamp1)
[1] 3.333333
(resamp2 <- sample(samp, size = length(samp), replace = T))
[1] 2 4 3 3 6 4
mean(resamp2)
[1] 3.666667
(resamp3 <- sample(samp, size = length(samp), replace = T))
[1] 5 6 6 5 6 5
mean(resamp3)
[1] 5.5

Why does this work?

The bootstrap is a plug-in estimator!

  • The estimand is now \({\textrm V}[\hat{\theta}]\), the sampling variance
  • If we had the population, we could compute \({\textrm V}[\hat{\theta}]\) by resampling from population
  • We don’t have the population, so we “plug in” the sample and resample from that
  • As \(n\) increases, the sample looks more like the population, so the bootstrap estimate of \({\textrm V}[\hat{\theta}]\) gets closer to the estimand

Compare to our approach to estimating \({\textrm V}[\overline{X}]\) previously:

  • Analytically determine that \({\textrm V}[\overline{X}] = \frac{{\textrm V}[X]}{n}\)
  • Use plug-in principle to estimate \({\textrm V}[X]\): compute var() in sample instead of population

We don’t need bootstrap for \(V[\overline{X}]\), but it will work for (almost) anything.

Bootstrap example

Let’s get a 90% confidence interval for the mean of env (jobs/environment tradeoff) in 2012 CCES.

dat <- read.csv("./../data/cces_2012_subset.csv")

Earlier, we learned this approach to estimating \(\sigma[\overline{X}]\):

std_error <- sqrt(var(dat$env, na.rm = T)/sum(!is.na(dat$env)))
std_error
[1] 0.01406871

This also works:

env <- dat$env[!is.na(dat$env)]
sd(env)/sqrt(length(env))
[1] 0.01406871

The bootstrap approach:

  • resample \(n\) rows with replacement from dat \(m\) times
  • compute \(m\) sample means
  • compute standard deviation across them

Bootstrap example (2)

I stored \(m=1000\) sample means in samp_means.

(bootstrap_std_error <- sd(samp_means))
[1] 0.013834

Very close to other solution!

Now for the 90% confidence intervals:

# using "analytical" approach (V[X]/n)
mean(dat$env, na.rm = T) + 1.96*std_error*c(-1, 1)
[1] 3.168247 3.223397
# using bootstrap
mean(dat$env, na.rm = T) + 1.96*bootstrap_std_error*c(-1, 1)
[1] 3.168707 3.222937

Using the bootstrap

Our estimand \(\theta\): correlation between env and aa in 2012 CCES (CCES is population)

Our estimator \(\hat{\theta}\): correlation in a sample of \(n\) rows

Variance of our estimator \({\textrm V}[\hat{\theta}]\):

  • True value can be approximated by repeated samples from population
  • How to estimate from sample:
    • Analytical solution (analogous to \({\textrm V}[X]/n\) for sample mean)?
    • Bootstrap!

Application of bootstrap (2)

  1. Repeated samples from population: draw 5000 samples of size 500 from full CCES; compute cor(env, aa) in each one
  2. Repeated resamples from sample: start with sample of size 500 from CCES; draw 5000 resamples; compute cor(env, aa) in each one

Application of bootstrap (2)

  1. Repeated samples from population: draw 5000 samples of size 1500 from full CCES; compute cor(env, aa) in each one
  2. Repeated resamples from sample: start with sample of size 1500 from CCES; draw 5000 resamples; compute cor(env, aa) in each one

Varieties of bootstrap

Naive bootstrap: Resample rows of the dataset with replacement (i.e. the method above).

Block bootstrap: For grouped data (e.g. students in schools), resample groups rather than dataset rows

Bayesian bootstrap: Keep rows of dataset same but draw random reweightings

Residual bootstrap: Keep rows of dataset same but resample residuals, i.e.

  • Fit a model for \(Y\)
  • Compute fitted value \(\hat{y}_i\) and residual \(\hat{e}_i\) for each observation, where \(y_i = \hat{y}_i + \hat{e}_i\)
  • Resample residuals with replacement, so \(y_i\) becomes \(\hat{y}_i + \hat{e}_j\)

Wild bootstrap (unrestricted): Keep rows of dataset same but rescale residuals, e.g. by draws from \(\{-1, 1\}\) (equal probability) or \(N(0, 1)\)

Randomization inference

Motivation

So far, we have focused on uncertainty that comes from sampling from a population: we don’t know about the units not in the sample.

  • Target of inference: Population quantity (e.g. population average, regression coefficient)
  • Source of uncertainty: Missing rows


In causal inference, we also care about uncertainty that comes from the assignment of treatment: we don’t know some of the potential outcomes, i.e. outcomes for a given unit if it had each treatment

  • Target of inference: Sample quantity (e.g. sample average treatment effect)
  • Source of uncertainty: Missing data (missing potential outcomes)

Randomization inference: procedure

Sharp null hypothesis is that treatment does not affect outcomes.

Under sharp null, we do know the missing potential outcomes: they are the same as observed potential outcomes.

So:

  • randomly reshuffle the treatment
  • compute treatment effect under sharp null
  • store and repeat

Get a \(p\)-value by comparing observed treatment effect to distribution of treatment effects under sharp null.